-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support multi-tenant RAM buffers for IndexWriter #13951
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into it. It feels like there should be closer integration between this and the existing FlushPolicy
/FlushByRamOrCountsPolicy
?
I think we'll also want to look into the performance overhead of this. It's likely not good to iterate over tens of IndexWriters on every indexed document to check if the total buffer usage is above the limit?
* @throws IOException if the directory cannot be read/written to, or if it does not exist and | ||
* <code>conf.getOpenMode()</code> is <code>OpenMode.APPEND</code> or if there is any other | ||
* low-level IO error | ||
*/ | ||
public IndexWriter(Directory d, IndexWriterConfig conf) throws IOException { | ||
public IndexWriter( | ||
Directory d, IndexWriterConfig conf, IndexWriterRAMManager indexWriterRAMManager) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be on the IndexWriterConfig
rather than another IndexWriter
ctor argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure that makes sense to me
|
||
/** | ||
* For managing multiple instances of {@link IndexWriter} sharing the same buffer (configured by | ||
* {@link IndexWriterConfig#setRAMBufferSizeMB}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be the other way around in my opinion, the RAM buffer size should be on IndexWriterRAMManager
, and setting a ramBufferSizeMB
on IndexWriterConfig
would internally create a new IndexWriterRAMManager
under the hood that is shared with no other IndexWriter
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I'm probably missing something here - so I get what you're saying about having the IndexWriterRAMManager
in the IndexWriterConfig
, but what would be the point of creating an IndexWriterRAMManager
for a single IndexWriter
? Wouldn't DocumentWriterFlushControl
be sufficient for this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but what would be the point of creating an
IndexWriterRAMManager
for a singleIndexWriter
I think the idea is to be able to create IndexWriters that don't share their RAM buffer limit with other writers. Maybe we could just set IndexWriterRAMManager
to null, if ramBufferSizeMB
is explicitly specified in the IW config (instead of creating a new ram manager).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I took a slightly different approach, will publish a new PR soon. Maybe we can discuss it there but pretty much I just do what @jpountz suggested and move the ramBufferSizeMB
to be a value held by the IndexWriterRAMManager
. I think we can then discuss if we should just disable calling writer flushes in the manager if there is only a single writer.
|
||
/** | ||
* Chooses which writer should be flushed. Default implementation chooses the writer with most RAM | ||
* usage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW we ran benchmarks in Elasticsearch, and the approach that worked the best was to flush IndexWriters in a round-robin fashion. This is not intuitive at first sight, but I believe that it works better in practice because it is more likely to flush DWPTs that are little used, and also because otherwise you favor IndexWriters that do little indexing over IndexWriters that do heavy indexing (and are thus more likely to have a large buffer).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm that's interesting! I can make that the default implementation then
So I actually had the same thought, but when I took a look at
I couldn't think of a clean way to integrate the two... but I'll give it some more thought
Ah I thought that the |
For what it's worth, these classes are package-private, so we can feel free to change their API. |
* @return the IndexWriter to flush | ||
*/ | ||
protected static IndexWriter chooseWriterToFlush( | ||
Collection<IndexWriter> writers, IndexWriter callingWriter) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the calling writer here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't use it but I left it there in case someone wants to override the FlushPolicy
in such a way that they might need the calling writer, IE: just flush the calling writer every time
"Writer " + id + " has not been registered or has been removed already"); | ||
} | ||
long totalRam = 0L; | ||
for (IndexWriter writer : idToWriter.values()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to cache ramByesUsed()
and only update it for the calling writer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this makes sense, I didn't realize that ramBytesUsed()
is not a necessarily cheap operation. Changed it in the next rev
Thanks for all the comments guys, I've been pretty busy with some life things/work, but hopefully I'll put out another update by tomorrow! |
2c6ed50
to
707d214
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it took so long to get to this PR. I have some code refactoring comments that should help simplify the changes.
} | ||
} | ||
|
||
private static class LinkedIdToWriter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using a java Queue implementation instead of the custom linked-list logic? You could round-robin on elements by removing, processing and add them back to the queue.
I suppose this queue size would be small, so array deque and linked lists are both fine? We can also get some thread safe implementations out of the box.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so I looked into using both Queue
and LinkedHashMap
to do this. The issue I found was that there was still complexity in maintaining the last id that was flushed and the total ram used, so the only function that really became less complex was the addWriter
function. I think given that this probably won't simplify the overall function too much, I'm inclined the keep the implementation the same way it is right now, but if you disagree feel free to let me know and I can take a second look at it.
/** | ||
* For use in {@link IndexWriter}, manages communication with the {@link IndexWriterRAMManager} | ||
*/ | ||
public static class PerWriterIndexWriterRAMManager { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need this class? IndexWriter
(and FlushPolicy
too) should already have a reference to the ram manager via IndexWriterConfig
.
How about we just store the writer's "ramManagerID" inside IndexWriter
, and use it to invoke ram manager APIs directly? In fact, can we work with the writer object directly instead of keeping these "id" mappings? Similar to how FlushPolicy accepts the calling DWPT in its APIs...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think that makes sense, I'll remove it
@@ -142,7 +142,21 @@ public IndexWriterConfig() { | |||
* problem you should switch to {@link LogByteSizeMergePolicy} or {@link LogDocMergePolicy}. | |||
*/ | |||
public IndexWriterConfig(Analyzer analyzer) { | |||
super(analyzer); | |||
this(analyzer, new IndexWriterRAMManager(IndexWriterConfig.DEFAULT_RAM_BUFFER_SIZE_MB)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for not making this change in the default constructor? We could avoid making changes to all the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reason I had to make all the changes in the test was cause I added. this constructor:
public IndexWriterConfig(IndexWriterRAMManager indexWriterRAMManager) {
And then in the tests there were a bunch of new IndexWriterConfig(null)
calls which became ambiguous. I think that this constructor is potentially useful which is why I took the hit and changed all those tests, but I can remove it to avoid all those test changes?
Just ran some benchmarks from Edit: OK it seems that |
Ugh ... "who goes first" bias? Maybe the line docs files was cold on your first run? Though it's surprising that'd make such a big difference overall (21 -> 16) since that usage (reading single file sequentially) is normally well optimized by OS readahead ... Could you please open a |
Sure @mikemccand , added an issue here. I also was able to test my change by just doing two separate runs and manually changing the test to point at whichever version I needed. My test setup involved hacking I realize that this is not a very realistic scenario, so I will try benchmarking with more writers and also skew the indexing rates of each writer to see how the change performs. Edit:
|
92aa344
to
4bf3b2a
Compare
4bf3b2a
to
00ec37f
Compare
@mdmarshmallow thank you for benchmarking this change! I know that is tricky and I think it's fine that we can't measure a performance gain? Achieving the same performance as N (2 or 5 in your tests) fully separate The only case I'd expect to see a performance win is if the rate of indexing is very unbalanced across the five shards ... e.g. you have five |
Description
Draft PR to outline my initial approach. I introduced
IndexWriterRamManager
to control writer flushes. I also have a functionIndexWriterRamManager#chooseWriterToFlush
that lets the user choose which writer they specifically want to be flushed, sort of like a simple flush policy. Currently it defaults to just choosing the writer with the most RAM usage.One thing I wanted to avoid was starting another thread to just poll the IndexWriter memory usage, so I just added a listener in
IndexWriter#maybeProcessEvents
which is called whenever docs are updated or deleted (IndexWriterRamManager#flushIfNecessary
). I wanted this method to be called wheneverFlushPolicy#onChange
was called, as I believe they both kinda do the same thingNo unit tests yet, but will add them! I just wanted to have my approach sanity checked for now
Revision 2
IndexWriterRAMManager
that is tied to theIndexWriterConfig
. The user can either pass in their own if they want multipleIndexWriter
instances to share a single buffer or just let theIndexWriterRAMManager
be created perIndexWriter
Revision 3
Revision 4